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Abstract 

This paper discusses the calculation of sensitivities, or derivatives, for op- 
timization problems involving systems governed by differential equations and 
other state relations. The subject is examined from the point of view of nonlin- 
ear programming, beginning with the analytical structure of the first and second 
derivatives associated with such problems and the relation of these derivatives 
to implicit differentiation and equality constrained optimization. We also out- 
line an error analysis of the analytical formulae and compare the results with 
similar results for finite-difference estimates of derivatives. We then attend to 
an investigation of the nature of the adjoint method and the adjoint equations 
and their relation to directions of steepest descent. We illustrate the points dis- 
cussed with an optimization problem in which the variables are the coefficients 
in a differential operator. 
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1 Introduction 


This paper discusses the calculation of sensitivities, or derivatives, for optimiza- 
tion problems governed by ODE, PDE, and other state equations. The context 
for this discussion is the general nonlinear programming problem 

minimize F(a) = f(a,u(a)) 

subject to Ce(ci, u(a)) = 0 (1) 

C/(a, 14(a) ) > 0 , 

with the distinguishing feature that u(a) is the solution of some set of equations, 

h(a, t/(a)) = 0. (2) 

For instance, (2) might represent the solution of the boundary-value problem 

-(#„(*)«'(*))' = ?(*), *e[o, l] , . 

«(0) = «(i) = o w 

where the coefficient K a (x) is given, say, by 

n 

I<a{x) = 22 ai<j>i(x) 

1 = 1 

for some fixed set of functions <j >\ , * * * , <j) n . 

While our discussion will focus on the case where the equations defining 
it are differential equations, other defining relations are possible. Problems of 
the form (l)-(2) can appear in discrete event simulation. Another example is 
the sensitivity of eigenvalues and eigenvectors. For instance, if A = A(a) is a 
smooth, n x n symmetric matrix-valued function of a, the system 

Av — Xv = 0 
v T v -1 = 0 

defines an eigenvalue-eigendirection pair u = (A, u). The pair (A, v) is a. smooth 
function of a when A is a simple eigenvalue, and one can apply the formulae we 
discuss here to compute the related sensitivities. 

The equation (2) typically describes the physical state of the problem. Ex- 
amples of optimization problems governed by state relations abound in inverse 
problems, parameter estimation, remote sensing, optimal design, and optimal 
control. We will refer to the variable a a is the model parameters and to u(a) as 
the state associated with a. The governing equation (2) will be called the state 
equation. 

We will examine the calculation of the derivatives associated with the prob- 
lem (1). We will henceforth ignore the constraints Ce and Cj in (1) and consider 
the ostensibly unconstrained problem 

minimize F(a) = f(a,u(a)) (4) 
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and study the derivatives of F and u with respect to the variable a, since the 
derivatives of Ce and Cj with respect to a are similar to those of F . This 
simplification helps us focus on the salient feature of u(a): its nature as the 
solution of (2). 

Our discussion of the calculation of sensitivities is motivated primarily by 
an interest in applying nonlinear programming algorithms to (1). The most 
generally effective optimization algorithms for problems such as these are quasi- 
Newton methods [8, 11], which require derivatives of the the objective function 
F and the constraints. Sensitivities are also useful in their own right to study 
the dependence of the state, objective, or constraints on the parameters a. As 
we shall see, the governing equation (2) imparts a great deal of structure to the 
calculation of derivatives. 

The goal of this paper is to interpret the language that one encounters in the 
literature on calculating sensitivities for differential equations in more familiar 
terms, and, in particular, to show the connections to classical ideas in nonlinear 
programming. Because we have in mind the optimization of systems governed 
by differentia) equations, we will frame our discussion in the general terms of 
functional analysis. 

The main theme of this paper is the systematic approach to computing 
derivatives based on implicit differentiation, and the significance of these deriva- 
tives for optimization. Among the particular points we will discuss are the 
following: 

• A careful derivation of the general formulae for the first and second deriva- 
tives of F. including the infinite-dimensional case. 

• The connection between the formulae for derivatives and equality con- 
strained optimization. 

• A comparison of numerical error estimates for sensitivity calculations via 
analytical formulae and by finite-differences. 

• The distinction between the derivative and directions of steepest descent. 

• The adjoint approach, and the sense in which the “adjoint equations” are 
adjoint. 

• Some potential difficulties with the adjoint approach in the context of 
optimization algorithms; in particular, how it may correspond to a non- 
standard choice of scaling for some problems. 

This exposition is intended partly as a primer for those unfamiliar with this 
type of sensitivity calculation, and partly to make it easier for those whose 
background lies in differential equations and those whose background lies in 
nonlinear programming to discuss optimization problems of mutual interest. 

The problem that is tacitly assumed as the model problem in this paper is 
the case where u(a) is the solution of a differential equation and a represents 
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either data in the problem — boundary values and source terms — or coefficients 
in the differential operator. Such problems make up a large proportion of those 
encountered in control and parameter estimation. One topic that vve will not 
discuss is shape optimization, in which the domain on which the state is defined 
varies, since this topic requires a great deal of machinery unlike that developed 
here. However, many shape optimization problems can be reduced to problems 
where the domain of definition for u is fixed and the variation in the shape is 
represented by the variation of some boundary term or coefficient, in which case 
the approach discussed here applies. For examples, see [23]. 

We begin in §2 with the derivation of formulae for derivatives. The results in 
this section are certainly not new, but the emphasis placed on the role of implicit 
differentiation may be unfamiliar to some, and the presentation is detailed, 
particularly the general derivation and interpretation of the reduced Hessian, 
which relies on various technical identifications relegated to §11. 

In §4 we present some error analysis for the calculation of sensitivities via 
the analytical formulae and compare the results with similar results for finite- 
difference estimates of derivatives. This comparison helps explain the often 
noted experience that analytical derivatives can be much more accurate than 
finite-difference approximations of sensitivities for systems governed by state 
equations. 

In §5, we discuss the relationship between the formulae for derivatives and 
equality constrained optimization. Here we examine what is called the adjoint 
state or costate in the differential equations and control literature and identify it 
as a familiar Lagrange multiplier estimate in linear and nonlinear programming. 

In §G and §7 we discuss two approaches to sensitivity calculations. In prac- 
tice, these approaches differ in the way in which they organize the intermediate 
calculations. The first is the sensitivity equations approach, which yields direc- 
tional derivatives. The second is the adjoint equations approach, which is an 
attempt to represent the derivative in a particular form and obtain a direction 
of steepest descent by inspection. Our discussion is based on the distinction 
between the derivative, which is a linear functional and as such lives in the dual 
of the domain on which the problem is posed, and directions of steepest descent, 
which are vectors in the domain that depend on a choice of norm. In IR n linear 
functionals are simply row vectors that may be transposed to obtain a direction 
of steepest descent. However, in the infinite-dimensional case the situation is 
more complicated. This we also discuss in §7, where we clarify what is “adjoint” 
about the adjoint equations in the context of optimization, and how the adjoint 
equations are related to a choice of norm, or scaling, defining a direction of 
steepest descent. 

We illustrate the discussion with a parameter estimation problem for an 
elliptic operator in §3 and §8. This example suffices to show how one computes 
first and second derivatives and directions of steepest descent with respect to 
different norms. This example also shows how one can go wrong by an uncritical 
use of the adjoint equations when they correspond to an unsuitable scaling for 
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the problem. 


2 Formulae for the derivatives 

We begin with the analytical formulae for derivatives for problems governed by 
state equations. These derivatives of the state and objective will follow from 
implicit differentiation. These formulae are derived in detail in order to be 
precise about the exact nature of the quantities that appear in the infinite- 
dimensional case, particularly in the expression for the derivative and Hessian 
of the objective. 

2.1 Notation 

Given a Banach space A\ we will denote its dual, the space of all bounded linear 
functionals on A', by A’'. We will denote the duality pairing between T E X ( 
and v E A" by Tv = (T, v) } or by (T, v) x if it is desirable to note the space 
involved. If A’ is an inner product space, we will denote by (♦ , *) or (• , • ) x the 
inner product. Given two spaces A* and V\ L{ A\V) will denote the space of 
bounded linear maps from A’ to Y. We will denote by I\ the identity operator 
on A'. 

The adjoint of a bounded linear operator A : A' — y Y will be denoted by A x . 
The adjoint A x : Y' — > X' is given by 

(A x y, x) x = (j/, Ax) y . % i € V". 

If X and Y are both Hilbert spaces, we will identify ,4 X with the Hilbert space 
adjoint ^4* : Y -4 A , defined by 

{x , A‘y) x = (Ax , y)y 


for all x E X and t/ E 1\ 

Given a map G : X — ¥ Y , we will sometimes denote its first and second 
derivatives at ;r by DG(x) and D 2 G(x). In the proof of Theorem 2.2 we will 
need to distinguish between the dependence of the derivatives DG and D 2 G on 
x and their action on vectors, which we will do by using brackets to delimit the 
arguments of DG and D 2 G as linear and bilinear maps: DG[v] = DG{x)[v\ and 
D 2 G[ t?i, vo] = D 2 G(x)[ VhVi]. 


2.2 The implicit function theorem and implicit differenti- 
ation 

The classical implicit function theorem [14] will suffice for the calculation of 
sensitivities in this paper: 
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Theorem 2.1 (The Implicit Function Theorem). LetX. U, andV be 
Banach spaces , and suppose h is a mapping from an open subset S of X x U 
into V. Suppose (ao,«o) a point in S such that 

1 . A(a 0 ,tio) = 0. 

2. h is continuously Frechet differentiable at (oo,«o)? ar? d 

3. the partial Frechet derivative dh/du(ao, wo) ts bounded ly invertible. 

Then there exists a neighborhood S of ao such that for each a £ XL the equation 
h(a } u) = 0 25 solvable for t/(a) £ {7. Moreover , the derivative of this solution 
u(a) with respect to a is given by 

du _ ( dh A ~ 1 dh 

da \du) da 

This formula for the Jacobian of u with respect to a is formally the result of 
applying implicit differentiation to h(a, u(a)) = 0 to obtain 

dh dh du 

da du da 


and thence (5). 

2.3 The reduced derivative and the reduced Hessian 

We will now apply the Implicit Function Theorem to derive formulae for the 
derivative and Hessian of the objective function F in (1). We will assume that 
?/. ( a ) is a locally unique solution to 

h(a, u(a)) = 0, (6) 

where h : (a,u) £ A’ x U — y V , and that dh/du is boundedly invertible. In 
practice, the validity of these hypotheses typically follows from the existence 
and uniqueness theory for the solution of the equation represented by (0). We 
will also suppose that / and h are twice continuously Frechet differentiable on 
a neighborhood of (a,u(a)). 

Let 



We will call IT the injection operator since it is a one-to-one mapping from A’ 
into A' x U and is invertible on its range; in finite dimensions it is a full rank 
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matrix. Its adjoint W x we will call the reduction operator. Observe that the 
range of W lies in the nullspace of the Jacobian of h: 


, TI . fdh 8h 
n «,u) hW -\da' du 


IX i j 

dhy dh = o. 

du ) da / 


Also define A E V* by 


, - -®L ( — 

du \du 


and the Lagrangian t( a , u; A) by 

f(a, u; A) = f(a. u) + (A, h(a. u)) v . 

The Lagrangian is normally associated with constrained optimization, a point 
to which we will return in §5, where we will discuss the nature of A as a Lagrange 
multiplier estimate known as the costate or adjoint state. 

Theorem 2.2. The derivative of F with respect to a is given by 

= ? I , (10) 


da du \du J da l(a ( u(a)) 


which may also be written as 
F'(a) = D( aM )f W 


l(a,u(a)) 


= D( a u )((a, u; A) W 


i (a,u(a)) 


where A = A(a,tx(a)). The Hessian of F is given by 


V 2 a F(a) = VL X (vf a u) f((a.«(a);A)) W 
V 2 (a .„/((<*, u; A) = V^ u ./ia. u) + (a, D^ u) h{a, u)) y . 


where 


The term ^A, warrants explanation. Since u ^h(a, u)[vi, v 2 ) € V 

for n, t ?2 G A' x {\ we have a real- valued bilinear form defined by 

(a. Vf 0 , 1 *) v [»„»,] = (a, 

In the finite-dimensional case, h = (hi , . . . Ji m ) T and we have the more recog- 
nizable quantity 
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Theorem 2.2 reduces to familiar results from nonlinear programming in the 
finite-dimensional case. Assuming vectors in IR n to be column vectors, formula 
(10) in Theorem 2.2 is an expression for a row vector (a linear functional on 
1R 11 ). We transpose to obtain the gradient: 

v a F=w T v {atU) e. 


The objective F(a) — /(a,u(a)) is called the induced objective ; we obtain the 
gradient V a F of the reduced objective by applying the reduction matrix W T to 
V( a „)/■ This is an instance of the reduced gradient in nonlinear programming 
[li], For this reason we will call dF/da the reduced derivative. Similarly, the 
expression (12) corresponds to the reduced Hessian : 

V 2 a F = w T v 2 au) c w. 

The reduced gradient and the reduced Hessian and the origin of the terminology 
“reduced” will be discussed further in §5. 

The proof of Theorem 2.2 is a straightforward calculation based on implicit 
differentiation. The one subtlety is the interpretation of some of the quantities 
encountered along the wav in order to arrive at (12), which looks like the familiar 
formula for the reduced Hessian. For instance, V 2 F = W x V 2 £ W means that 


v 2 Ffo,,ife] = v 2 e[w m , wth] = v 2 e[( m , (t, 2i ^7,2)] 


for all 7)\,t)2 E A\ The identification of this latter formula with (12) requires 
the results in §11. 

Proof. Computing the derivative of F, we see that 


dF t x df , , df , , ^du, x 


From this and the Implicit Function Theorem we obtain the following expression 
for the derivative of F: 


^~( a ) = !£( a • u ( a )) ~ «( a )) u ( a ))) «(«)). 


da 


which is (10). This can be rewritten as 

s«") = w = {%■ ji) w - 


this and (8) yield (11). 

We now turn our attention to the Hessian. We have 

d 2 F _ d 
da 2 da 


f a {a,u(a)) + f u (a,u{a)) 


du 

da 


i 



in the sense that for all 77 *, 772 € A' 


d 2 F d 2 f r , df a ,du 

d/ u r du , <9 2 / r du du d 2 u 

+ a7 [ ' ?1 ’ 5a ^ + fe 171 * 55' 721 + /u ^ [,?1 ’ r?2] ’ 


where the partial derivatives on the right-hand side are evaluated at (a,u(a)). 
Here we are using the identification of Hessians and bilinear maps in §11.2. 
Using the interpretation of adjoints and bilinear forms in (54) in §11.4, we can 
rewrite this as 


d 2 F 



( d 2 f d 2 f \ 


( \ 

(j d Jt x ) 

da 2 duda 


I 

\ da ) 

d 2 f d 2 f 


du 


\ dadu du 2 / 


V da / 


U 


* ( V (a,u)/) H + du 


df d 2 u 
du da 2 


df d 2 u 
+ du da 2 


(13) 


Meanwhile, implicit differentiation of 


du 

h a {a , «(a)) + A u (a,«(a)) — (a) = 0 


vields 




dh 

du 


1 (d 2 h r . dh a du , dh u du d 2 h du du 

(.a? 1 ’ 1 ' mi + + Ta’ l2] * ft? 1 Ta"" Ta m] 


for all r?i , ?/2 E A', so 

df d 2 u 
duda 

. fd 2 h r , d/? ar du dh u du d 2 h du du 

= x + *-[*■ 5JU1 + apljl’n. *■») 


Since the right-hand side is a real- valued bilinear map. we may again apply (54) 
in §11.4 to rewrite this as 


555^1’ ^ = ( iyx ( Al u ) [% ’ (14) 

Combining (13) and (14) yields (12). □ 
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3 Example 

We will apply Theorem 2.2 to compute the derivative for a least-squares func- 
tional associated with the following boundary value problem (BVP): 

—X? (aVu) + bid Tl u = q in Q 

u = 0 ondQ. 1 j 

We assume Q is smoothly bounded. We use the summation convention through- 
out; if an index occurs twice in a quantity then summation over that index is 
implied: bjd Tl u — bid x ,u- For simplicity, we will assume that a = a(x) is 

a scalar function. We will assume, too, that b{,q E T°°. Existence, uniqueness, 
and regularity of solutions of this problem are discussed in [10, 17]. 

For simplicity, we have chosen a problem for which the state equation is 
linear in the state and the boundary values are homogeneous. We will consider 
the following nonlinear least-squares functional: 

minimize F(a) = | / dx (w(x) — u*(x)) 2 , 

Jn 

where u. E L°° . For instance, this objective might represent a parameter es- 
timation problem, in which case the data would represent observations the 
mismatch with which we wish to minimize. For a further discussion of the pa- 
rameter estimation problem, see [3, 15, 26] and the references therein. This 
functional could also arise in inverse design, where u* would represent some 
desired state that we are attempting to achieve by varying a. Our goal here is 
only to study how one computes derivatives, and we will ignore the question of 
the existence of solutions to the minimization problem. 

We will consider weak solutions to (15). For now we will let A' = L°°(fi), 
though later we also consider the case where A' = C k,c *, the space of C k functions 
with Holder continuous derivatives of order a. A suitable domain for a is 

5 = { a € I j a > a* > 0 } 

for some positive a* E IR . The state u resides in U — Hq(£1). 

The weak interpretation of the BVP (15) means that the state constraint h 
is a map 

h : (a, u) E S x U -4 h(a , u ) E V = (Hq (Q)) ; 
where for v E H$(Q), 

(h\a, u), v) H i — I dxaVu-Vv - f / dx (bid Xl u)v — dx qv. (16) 

0 Jn Jn Jn 

The relation that defines u as a function of a is h(a,u(a)) — 0 in (Hq(Q))'. 
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We begin by computing the various quantities needed to apply Theorem 2.2. 
Since h is an affine function in u, it is Frechet differentiable with respect to u. 
Computing 

dh h(a,u + tv) — h(a,u) 


we find that 


-s- V- lim 

OU (->-0 


= .{aS}v) + hid x v 

OU 


in (H^{Q))\ in the sense that 


= I dxaVv Vu + dx (bid Xl v)v. 
Hi -'n 


In a similar way we obtain 


-77 = —V • (r/Vu) 


Again, this equality is to be interpreted in the weak sense, as elements of 

(nh( «))'• 

Both (17) and (18) are expressions for a Jacobian-vector product — a direc- 
tional derivative — rather than an explicit formula for the Jacobian. Directional 
derivatives such as these are straightforward to compute. 

Following the program in §2. we wish to apply implicit differentiation. First 
we check that dh/ du is boundedly invertible, that is, that for all $ € . 

there exists a weak solution v E of the linearized boundary- value problem 


— v — -V • (aW) + b{d Tt v - $ in Q 
du ^ 

v = 0 on <9D, 

and that the solution operator is bounded: there exists C. independent of <F, 
for which 

II v llni(n) ^ ^11 ^ ll(K^(n)) / ■ 

In this case, the bounded invertibility of dh/du follows from the existence theory 
for elliptic equations in divergence form [10, 25]. 

Thus we may apply the Implicit Function Theorem to conclude that the 
action of du/da — the Jacobian of u with respect to a — on a vector 77 is given by 
the solution of the linearized BVP 

Lv = -V ■ {aVi/) + bid Xl v = V ■ (tjVu) in Q . . 

i/zzQ on dQ. 1 } 


This corresponds to 


dh 1 dh 


— ^ = -—77 or v - — 77-— 77 

du da du da da 
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in the notation of §2. 

We now arrive at the action of the derivative F f (a) on 77 . Let 


v — 



v is defined by (19). We also have 



Then by (10), we have 


F'(a)r ) = [ 


dx (it — u«)v. 


This vields the action of F f (a) as a linear functional. 


( 20 ) 


4 Analytical vs. finite-difference approximation 
of sensitivities 


In this section we will draw some comparisons between the numerical accuracy 
of the analytical derivatives of §2 and that of finite-difference estimates. We will 
consider the case where the state equation is linear in u : 


h(a , 1 /) = .4(a) u — 6 = 0. 


Given a — (ai , - * * , a n ), we compute the matrix A(a) and solve the system 
An — b for w(a). For instance, such a linear system would arise in the solution 
of a boundary-value problem such as (3) or (15). As we shall see, the error 
estimates are guided by the fact that small changes in a will generally cause 
only small changes in .4, but. if the system is ill-conditioned, may cause much 
larger changes in v. 

Let’s see what might happen if we apply finite-differences to compute the 
partial derivative 

,, , _ du 
u (a) = 

which is the i ,h column of the Jacobian of u with respect, to a. 

We will need the following basic estimate concerning the sensitivity of the 
solution of linear systems to changes in the data, adapted from [13]. Let xc(A) 
denote the condition number of A. 

Theorem 4.1. Suppose A 6 JR nxn is nonsingular, b £ IR n , Ax = 6, and 
suppose (A 4- AA)y = 6 + A6, where || A -1 || || A A || < 1. Then 


\\^ 1 y\\ < 1 

11*11 - 1-IM-MIIIAX 


i-i 


A6 


+ I A~ 


AA 


( 21 ) 
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Moreover . if \\ AT || < £ || A || and || A6 || < e || 6 ||, there are perturbations for 
which this bound is achieved to first order in e. 

Of course, this hound is quite pessimistic for most perturbations. For in- 
stance, a small perturbation of the form AT = aT is benign, and its effect does 
not involve k(T). On the other hand, there are perturbations for which these 
bounds are nearly obtained, which is of significance to us. Moreover, if A has 
a certain sparsity pattern — say, if A were associated with a finite-difference or 
finite-element scheme — the perturbations AA that produce this sensitivity can 
have the same sparsity pattern as .4. 

Let be the i th standard basis vector. We will assume that a,* as 1, and 
consider the effect of a finite-difference step t « ra 2 , where t reflects the absolute 
size of the step and r the relative size. We will use p to denote machine epsilon, 
the smallest floating-point number for which 1.0 + p — 1-0 (in floating-point). 

Let u*(a) be the solution to the linear system A(a)u = 6 computed in exact 
arithmetic, while u(a) will be the computed solution. Let e(a) — u(a) — u*(a) 
be the associated error in the solution; we will assume that u is computed 
as accurately as possible, so that || e( ) || = 0(k(A)p). We will assume that 
k(A)p 1 so we can ignore the issue of numerical singularity. 

As we saw in (5), the exact partial derivative it' (a) is the solution of 

T*(a)u'(a) = — — — (a)u* (a), (22) 

aai 

where the subscript V on the matrices denotes their representation in exact 
arithmetic. The computed partial derivative u'(a) is the solution of 

dA 

A(a)u'(a) = ti(a), (23) 


where the matrices are the floating-point representations of the exact matrices. 
Comparing (22) and (23), we expect || AT || = || .4 ( a ) - T*(a) || < p || T*(a) ||. 
while the change in the right-hand side is 


A6 = 


dA. 

dai 



dA 

u*(a) T (w.(a) - «(«)) , 


from which we obtain 

II II < A* 

< //(1 + k(T)) 


dA, 


d. A, 

dai 

II w.(o) II + 

dai 


dA . 
dai 


u * ( a ) 


We will now make the assumption that 


SA. , 


dA. 



da-i 


u* (a) — u(a) 


u*(a) 


(24) 
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where here « means equivalence up to a factor that is small by comparison to 
k(A). Under this hypothesis, combining the preceding estimates according to 
(21) we see that computing u f via the analytical formula satisfies a relative error 
estimate of the form 


II «U«) - «'(«) II 
II <( a ) II 


0(k 2 (^). 


( 25 ) 


This suggests that computing u f via the analytical formula is comparable in 
condition to solving least-squares problems. The factor k 2 (A) is not entirely 
unexpected, since the calculation of u' involves the solution of two linear sys- 
tems. one for u and then another for u* . 

Next consider the finite-difference approximation and its two sources of error: 
truncation error, due to the nonlinearity of the function being differentiated, and 
condition error, due to inaccuracies in computing the function [11, 20]: 


^/(a + tei) — u(a.) 
- — 


\l {a) 


(u.{a + tei) - u»(a) , e(a . + <e,) - e(a) 

t 7 u(a) J + 7 

truncation error + condition error. 


These are the Scylla and Charybdis of finite-difference approximations, since 
reducing one error tends to increase the other. 

Under our hypotheses, the relative error due to condition error satisfies 

dA m 

dai 

In practice, condition error is exacerbated by the use of iterative solvers in the 
solution of the state equations, among other things. In particular, the stopping 
criteria for iterative methods increases the condition error: consider solving a 
discretized differential equation, where u would represent a discretized function. 
The iterative approximation of u might be abandoned when the error in the 
computed solution is believed to be comparable to the error inherent in the 
level of the discretization [21], rather then when the relative residual of the 
system being solved has been reduced to the order of floating-point precision, 
thus increasing the condition error. However, here we will restrict our attention 
to the errors solely attributable to the conditioning of the state equations. 

Now consider the truncation error. In practice, analytical nonlinearity in u 
may be amplified by numerical nonlinearity. For instance, numerical methods for 
the solution of differential equations that contain switches such as upwinding will 
contribute to the nonlinearity of the dependence of u on a. If we were applying 
finite-differences to estimating dF/da t in (1) and avoiding the intermediate 
state u, then we might also have to contend with adaptive meshing methods 
that could change the state space as a function of a, another contribution to 
truncation error. Again, we will restrict ourselves here to the effects of the 
condition of the state equations. 


c(q + te i) - €{a) || || e(o) jj k(A)h || v*(a) || ^ k(A)/i 

<||<(a)|| ~ 1 1| «i(o) || - 1 1| m' (a) || ~ t 
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We have 


A. {a + tei) - -4. (a) 5.4 „ 

7 = ~d^ (a) + E - 

We may expect E to be small relative to .4(a) if 4 depends in a straightfor- 
ward manner on a. For instance, for the example (3), the discretized operator 
constructed for a finite-difference or finite-element scheme would be a relatively 
simple algebraic function of the coefficient parameters a. For convenience, define 

ti,(a + tei) - u.(a) 
u„(a) = - . 


Then, 

A.(a)u',(a) = — f^b(a) + u,(a + tei). (26) 

Meanwhile, consider A.4 = A. (a+ fe<)- A. (a); we expect || AA || « r|| .4. (a) ||. 
and the estimate (21) yields 


|| u,(a + tei) ~ u»( fl ) II < rK (- 4 >( a )) 

|| u.(fl) II 1 - tk(A.)' 


(27) 


Comparing (22) and (26) using the perturbation estimates (21) and (27), we 
obtain 

II <( a ) II 

Combining the bounds on the condition and truncation errors, we obtain a 
bound of the following form on the relative error in the finite-difference estimate: 


u(a -I- tei) - «(<0 


u [a) 


/ || «',(«) II < ci k 2 (.4(«))t + 


c 2 K(4(a))p 


Minimizing this in r gives a bound that is 0(K 3 ^ 2 (A)fj 1 ^ 2 ). In view of our 
hypothesis k(A)^ <§: 1, this bound is much more pessimistic than the 0(k 2 (A)h) 
bound on the analytical derivative, itself no great shakes. 

This analysis suggests finite-difference approximations of derivatives associ- 
ated with state equations are potentially much more sensitive to ill-conditioning 
of the state equations than are derivatives calculated using the analytical for- 
mulae. Whether or not one sees these pathologies depends on the condition of 
the system being solved and the the perturbations of that system caused by 
changes in the design variables a. And. as we have noted, the analysis sketched 
here also ignores other sources of error that one encounters in practice that can 
have an even more pronounced effect. 

While in practice one can generally use finite-differences successfully, there 
remains the possibility for serious and unavoidable errors. One can construct 
algorithms for unconstrained optimization problems using inexact gradients 
[5, 22], but errors in the gradient can retard progress. Inaccurate derivatives are 
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also a problem for sensitivity analysis in design (i.e.. approximating the local 
behavior of a function about a nominal design using a first-order Taylor's series 
model). The potential for unpredictably inaccurate finite-difference approxima- 
tions of sensitivities is one motivation for examining analytical techniques for 
computing derivatives. 


Relationship of the sensitivity calculations to 
equality constrained optimization 


In §2.3 the Lagrangian 

C(a , u; A) = /(a, it) + (A, h(a , «)) 
was introduced with the multiplier A £ V* defined by 


X -J1 

A du 


dh\ 

du) 


-1 


(28) 


The motivation for introducing the Lagrangian comes from viewing the problem 
( 4 ) as an equivalent equality constrained problem: 

minimize f(a, u) ^ 9 ) 

subject to A(a,«) = 0 , 

where now both a and u are independent variables. From this point of view 
the costate A serves as a Lagrange multiplier estimate [11. 24]. The assumption 
that dh/du is boundedlv invertible allows us to invoke the Karush-Kuhn-Tucker 
necessary conditions for a feasible point (a* , i/*) to be a solution of (29) [7]: there 
exists A* £ V f for which 

D^ a u )({cLm , u* ; A«) = D(a,u)f (o* ♦ w*) T (A* . D ( a ' ^ * ) ) 0. 

In particular, the u-component of this system is 

8f , , x dh t , n 

— a*,u* ) + A.— — 0 . 

du ou 

From this and the definition of the costate (28) we see that A is an estimate of 
the Lagrange multiplier associated with (29) that, is consistent with the first- 
order conditions at a locally constrained minimizer: i.e., A = A* at a minimizer. 
A further discussion of the topic of multiplier estimates can be found in [11, 24]. 

The costate A corresponds to two common multiplier estimates in linear 
and nonlinear programming, the shadow costs or reduced costs in the simplex 
method [ 6 ] and the variable reduction multiplier estimate in nonlinear program- 
ming [11]. To see this correspondence, first consider the Jacobian of the state 
constraints in the finite-dimensional case: 

^ “Wv, 

da du ) 


B). 
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We are assuming that B - dh/du is boundedlv invertible, so we may take the 
corresponding variables, the state variables it, as the basic variables (so-called 
because the columns of B form a basis) and the model parameters a as the 
nonbasic variables. Then \ T = B~ T V u f . 

Now consider an iteration of the simplex method for the linear programming 
problem 

minimize c T x 

subject to Ax = b 

Xl < % < - 

One determines the components xjv of x for which the inequality constraints 
are binding, and forms an invertible block B from the columns of A correspond- 
ing to the remaining components x#, and a vector eg from the corresponding 
components of c. The shadow costs n are then defined to be 7 r = — B~ T eg , 
corresponding to the costate A. 

In the case of nonlinear equality constrained programming, 

minimize f(x) 
subject to h(x) = 0, 

the variable reduction multiplier estimate at x is computed by first finding an 
invertible block of columns B of the Jacobian of h. The multiplier estimate is 
then 7 r = B~ t \~ bI( x )' where V J g/(x) are the corresponding components of the 
gradient , and again we see the correspondence with A. 

The basic/nonbasic partition comes about by viewing the basic variables as 
functions of the nonbasic variables. This reduces the problem to one in the 
nonbasic variables alone; hence “variable reduction," “reduced gradient/' and 
“reduced Hessian." In the case of state constraints, we can treat the state u as 
a function of a in (29) and eliminate u as an independent variable to obtain (4). 
The costate multiplier is derived from a fixed partition of the variables in which 
the state variables are always the basic variables and the model parameters a 
are always the nonbasic variables. This is unlike the general case of linear and 
nonlinear programming, in which the basic and nonbasic partition tends to vary. 

In the nonlinear programming literature, this relation between equality con- 
strained optimization and systems governed by state relations goes back at least 
to [1] and work cited there, where it is discussed in the context of the general- 
ized reduced gradients algorithm. Further consequences of the basic/nonbasic 
partition of the state and model variables can be found in [18]. 

6 Sensitivity equations vs. adjoint equations 

The order of calculation in (5) and (10), which we followed in §3, corresponds 
to the approach to computing derivatives known as the sensitivity equations , 
as well as computing sensitivities via finite- differences or the forward mode of 
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automatic differentiation [4]. The sensitivity equations approach is equivalent 
to computing directional derivatives, and for this reason it is most applicable 
when there is a small number of design parameters a. 

The following example makes the idea clear. We modify our example (15), 

— V • (K a Vu) -f bid Xt u = q in ft 
u — 0 on 

so that the coefficient in the leading term is parameterized as a function of a 
set of model parameters a = (a*): 


n 

K a = 22 ° i ® i 

1=1 


for some (small) set of basis functions { <2>i , • * • , <p n }. 

Formally, the sensitivity equations are derived by applying d/dai to the 
governing st ate equations and interchanging the order of differentiation to obtain 
a relation defining du/dai'. 


-v '(^f v “ )_v '( /CaV £ ) + 6Ai £ L = 0 lnQ 

p- - 0 onSO 

da i 


(30) 


In terms of the discussion in §§2-3, this is nothing other than implicit differen- 
tiation of h(a r u(a)) — 0 to obtain 


dh dh du 

dai du dai 


The sensitivity equations yield du/dai. If we wish to compute dF/da{ for some 
functional F{a) = f(a.u{a)). we would use du/dai and the chain rule. 

The sensitivity equations approach is attractive when one has a large number 
of outputs but only a relatively small number of inputs. Suppose we wish to 
compute sensitivities not just for a scalar output F, such as the objective in 
(1). but a vector-valued function C(a) = c(a,u(a)), where c : lR n x ZR m -4 IR q , 
such as the constraints in (1). The Jacobian of C is given by 


dC_ 

da 


dc dc du 

da du da 



qxm 


(31) 


In the sensitivity equations approach, we tacitly compute du/da as an interme- 
diate quantity, which requires n solutions of the sensitivity equation, no matter 
the number of state variables u or outputs C . We compute an entire column of 
the Jacobian of C each time we solve the sensitivity equations. 
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On the other hand, if one has a relatively large number of inputs, the sen- 
sitivity equations may not be practical, since every partial derivative requires 
the solution of the sensitivity equations (i.e., the linearized state equation (30)). 
This motivates the adjoint approach. 

Transpose (31): 


VC(«) 



77 x m 


mxm 



(32) 


where VC denotes the transpose of the Jacobian. Then we see that this trans- 
posed sequence of operations requires q solutions of the transposed linearized 
state equations (q applications of (dh/du)~ T ). If q n, this will be preferable 
to the expense of the sensitivity equations approach. This ordering of operations 
is the gist of the adjoint approach and the reverse mode of automatic differen- 
tiation. In the case of IR n \ the adjoint corresponds to the matrix transpose. 

For an optimization problem, the adjoint equations approach — ordering the 
calculation of derivatives as in (32) — is very attractive because one obtains the 
gradient of the objective F. disirregardless of the number of model parameters 
a. via a single application of the transposed solution operator (Oh/du)" 7 . More 
generally, the effort required to compute sensitivities (say, of constraints) via 
the adjoint approach grows with the number of outputs rather than with the 
number of inputs. 

The adjoint approach requires us to solve linear systems involving (dh/du)~ T . 
If we have dh/du at hand as a factored matrix this is not ail that difficult. How- 
ever. dh/du might not be readily available, say, if h{a,u(a)) = 0 is solved via 
a nonlinear fixed-point iteration, or only the action of dh/du is available be- 
cause systems involving it are solved using an iterative scheme. In either case, 
implementing {dh/du)~ T will require a fair bit of effort on the part of the user. 

In the finite-dimensional case the sensitivity equations and the adjoint ap- 
proach are simply two different ways of computing a product of matrices. De- 
pending on the relative dimensions of the matrices, one or the other method 
will be the more attractive. However, in the infinite-dimensional case, the situ- 
ation is more subtle. The complication arises in the switch from row vectors to 
column vectors in the adjoint approach, i.e., the transposition of (31) to obtain 
(32), the significance of which we will now discuss in greater detail. 


7 The representation of derivatives and the ad- 
joint approach 

We have seen that the attraction of the adjoint approach in finite-dimensional 
optimization is that one obtains the gradient of the objective for the cost of 
solving a single linear system. Abstractly, the derivative F f is a linear functional 
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on IR n , while the gradient — the direction of steepest ascent — is a direction in 
IR n . We can pass between the two because of the identification of IR n and its 
dual, which does not necessarily generalize to the infinite-dimensional case. The 
derivative of F described in Theorem 2.2 resides in the dual A'', and we cannot 
necessarily identify A' with X. We can connect the tw'o spaces through the 
notion of a descent direction — a direction p £ X for which F'(a)p < 0. At the 
very least, such a direction is needed in order to apply a quasi-Newton method. 
This leads us to a discussion of directions of steepest descent, the representation 
of linear functionals, and the adjoint equations. 

7.1 Directions of steepest descent and the action of the 
Hessian 

First recall the definition of a direction of steepest descent [12]. Suppose A' is 
a normed linear space with norm || * || A , and suppose F : X — > 1R is Frechet 
differentiable at a with Frechet derivative F'(a) £ AT Then the direction of 
steepest descent w r ith respect to the norm || - ||^ is a solution of the problem 

minimize ( F'(a ), p) 

subject to || p \\ x < 1, 

provided that a solution to this minimization problem exists. In the case of a 
reflexive Banach space, we are guaranteed at least one solution to (33) because 
the unit ball B will be weakly sequentially compact [27]. Given any sequence 
{Pk}* || Pk || < T for which 

lim <F'(a), p k ) = L = inf <F'(a),p), 

k >■ OO i|p||<l 

the weak sequential compactness means that we can find a subsequence con- 
verging to a point p* for which { F r (a ), p*) = L. 

Note that the direction of steepest descent depends on choice of norm — the 
direction of steepest descent indicates the direction of greatest decrease in F 
per unit distance, and the distance depends on the norm. The derivative is a 
linear functional independent of choice of norm; the direction of steepest descent 
depends on what one means by ‘"steepest”. A short step in the L 2 norm may 
not be a short step in the H 1 norm, for instance, since an oscillatory function 
may have a small L 2 norm but a very large H 1 norm. This aspect of the choice 
of norm has practical bearing on the behavior of optimization algorithms. The 
choice of norm — the scaling — can have a profound impact on the efficiency of 
optimization algorithms [8, 11]. 

A similar concern arises in interpreting the action of the Hessian H = V 2 F. 
The Hessian is an element of the space L(X,X') (§11.2); accordingly, the 
Hessian- vector product Hp is an element of AT and again we ask how this 
linear functional can be related to directions in X. As with the direction of 
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steepest descent, a natural problem to pose in order to represent the Hessian- 
vector product Hp as an element of A is: 

minimize (Hp, q) 

(34) 

subject to || q || < 1. 

In the case of a Hilbert space, we have X f % A' and L( A\A') « L(X, A'), so 
there is an immediate interpretation of Hp as an element of A’. In this case, 
the solution q of (34) will point in the direction of - Hp, 

The conjugate gradient algorithm illustrates the preceding discussion. Con- 
sider the minimization of the quadratic form 

q(x) = ^x T Ax — x T b , 

where A £ ZR nxn j s symmetric positive definite. Following [9], we can summa- 
rize the conjugate gradient algorithm as follows: 

xq = 0, f’o = b, k = 1 
while h--i ^ 0 { 

get dk such that djr k -i ^ 0 
x k - argmin q(x) 

r€span{pi, ,p*-i ,dk } 

Pk = Xk - Xk-1 
r k = f'k — i - Apk 
k = k+ 1 

}• 

In the un-preconditioned conjugate gradient algorithm, at iteration k we min- 
imize q over the span of the preceding search directions and the direction 
dk = r/c_i = b — Axk-i = — Vg(xfc), corresponding to the usual direction of 
steepest descent with respect to the i 2 Euclidean norm. On the other hand, if 
we choose d k — M~ l r k - 1 for a symmetric positive definite A/, we obtain the 
preconditioned conjugate gradient algorithm. However, note that lies 

along the direction of steepest descent with respect to the norm induced by 
the inner product (x.y)M = x T My. Thus, computing a direction of steepest 
descent with respect to an inner product other than the usual Euclidean inner 
product leads to the preconditioned conjugate algorithm. 

The connection between elements of the dual and directions in the domain 
given by (33) and (34) also allows us to give a sensible interpretation of the 
following aspect of the conjugate gradient algorithm. Suppose that A comes 
from a finite-difference discretization of 

-V • (aVu) = q on Q , , 

u = 0 on <9ft. 
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The matrix .4 : IR n — IR n approximates an infinite-dimensional operator .4 
that is a map A : ► Lr or A : Hq — > In the finite-dimensional case, 

we look for Xk in span{pi, * • ■ ,pk- i, dk }, where dk — b — Axk . But this does not 
make sense in terms of the underlying infinite-dimensional problem: dk lies in 
what should correspond to the range of .4, and the range and domain of .4 are 
not the same in this case. We can resolve this apparent inconsistency if we view 
dk as the solution of a steepest descent problem (33). 


7.2 The adjoint approach 

The adjoint approach is an approach to computing a direction of steepest de- 
scent. The point of view that we present here is that the adjoint approach is 
a no-holds-barred attempt to express the action of the derivative F'(a) in the 
following form: For some function g = g(a)< 

(F'{a). p) = J gp. (36) 

The goal of the adjoint approach is to find such a representation, if it exists. 

One reason such a representation of the derivative is convenient is that it 
suggests a direction of steepest descent and a choice of norm (scaling). If, for 
instance. g(a) E X and A' C I 2 , then g(a ) determines the direction of steepest 
descent in A' with respect to the L 2 norm: the Cauchv-Schwarz inequality says 
that the solution of 

minimize / gp 

P€A' J (37) 

subject to || p || L 2 < 1 

is -g/\\ 9 1 1 £ 2 • More importantly, as we will see in §8, a representation of the 
derivative in the form (36) makes it possible to compute the direction of steepest 
descent with respect to choices of norm other than the Lr norm. 

Having described the goal of the adjoint approach, we will now give an 
abstract description of its nature and then pass along to a concrete example. 
At this point the adjoint equations make their appearance, and we can clarify 
what is ‘’adjoint" about them. 

We start with (10) and play some notational tricks. Given r) E A*, 



Since 

du _ f dh \ dh 
da \ du ) da ’ 


du * df 
da du ’ 


9 


x 
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we have 

d^ dJ _ = _dh* (dh\-* d _i 

da du da \du J du 

The adjoint equation, represented by dh x /du, has now appeared. It is the 
adjoint of the linearized state relation — adjoint in the sense described in §2.1— 
and as such always exists. 

The solution operator for the adjoint problem is a map 


dh‘* 

du 


d -L e u' 

du 


-4 



x df_ 

du 


G V 


_d_h* (dh\~ X d_l d_h x (dh\*(dl 

da \du ) dv da \du ) du 

This yields the infinite-dimensional analog of (32): 


Ha)*? 


/a/ x _ (Oh. Y* \ 

\^da da \<9u/ du T 


(39) 


One hopes that when the dust clears, F r (a) has been revealed in the form (36). 

The adjoint approach also leads to an alternative expression for the costate 
A. From (9), A £ V* satisfies 





for all v £ U . However, 



allowing us to rewrite (39) as 


Ha)* 


d£ X 

da 




(40) 


(41) 


Also note that the adjoint equations can tell us how to compute an action 
of the Hessian of F on vectors. If we can identify p £ A* with elements of X* 
through a duality pairing such as (36), and if for all p £ A' we can identify 


du* _ _dh x /5/A-* 
da P da \<9uy ^ 


which is in A ', as an element of A’, then the adjoint equations tell us how to 
compute IF* according to (7). and the action of the Hessian of F via (12). 
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8 An illustration of the adjoint approach 

VVe will illustrate the adjoint approach using the example introduced in §3. We 
begin by computing the adjoint equation and the other adjoint operators that 
appear in (39). We then use these results to compute directions of steepest 
descent and the action of the Hessian. 


8.1 The adjoint equation and other adjoint operators 

Recall that dh/du maps r £ Hq to the linear functional in (HqY defined by 

Lv = -V • (aVv) + bid Xt v in Q 

v = 0 on dQ: 

that is. (dh/d u)v £ (HqY is defined by 

= j dx aVw • Vr 4- j dx (bid Xi v)w. 


for all in £ Hi 

The adjoint (dh/du) x maps w £ (HqY* % Hq to the linear functional in 
(HqY defined by 

L x w = —V * (gVuj) — d Xx (b{w) in Q . 

w = 0 on dQ. * * ' 

To see this adjointness, note that the definition of the adjoint and the reflexive 
identification of ( HqY ' and Hq means that 


— w, v ) = < w, —v 

OU / IT i \ uu 


v, w ) = { Lv . 


Meanwhile, the standard weak interpretation of (43) means that for all £ 

Hi. 

{L x u\ — I d.x aVw ■ Vv + wbid Xt v = (Lv, w) H i . 

0 Jq 0 

Thus (43) defines (dh/du) x . 

The operator (dh/du)’ x is the solution operator for the boundary value 
problem (43). Since (dh/du)’ 1 is a map (HqY Hi its adjoint (dh/du)~ x is 
a map ( HqY (#o)" ^ Hi which is again consistent with the interpretation 

of (43) as representing the weak formulation of a PDE. 

We also need to compute (dh/da) x as part of the adjoint calculation (39). 
For T} £ T°° we have 

= -V • (qVw) € [Hi)', 
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in the sense that for r E H$ we have 


dh 

da 


-rj. v ) = / dx r/Vu • Vt\ 

! /Hi 

We have - Vr £ (/- 00 ) / ; then from 

’• f”L ,)i -' 


we see that 


9fi x _ _ 

— = Vti ■ Vr\ 

aa 


Using (43) and (44) we can now compute 

du x _ _dh x (dh\~ X 

da V da \du ) 
We first compute the solution w of 


(44) 


L x w = —V • (aVtc) — d Tl [b{w) — v in Q 
w = 0 on dQ 


(45) 


to obtain w — (dh/du) x v . and then 

dh x ^ v- 

— — w = -V« • (46) 

9a 

yields (du/ofa) 

All these calculations and identifications (rather tediously) work with ad- 
joints in the sense of the definition in §2.1. This sense of adjointness is not that 
of an inner product space adjoint; the adjointness discussed for this example is 
certainly not that of a Hilbert space adjoint, for instance. One could attempt 
to interpret adjointness in this example in terms of the L 2 inner product, but 
such an interpretation would lead one to unbounded operators on L 2 and signifi- 
cant theoretical complications. The “adjoint 5 ' of the adjoint equations should be 
taken to refer to the adjoint that maps between dual spaces, just as in the theory 
of weak solutions of differential equations. Thus one avoids unbounded oper- 
ators. For observations on very similar difficulties with adjoints of unbounded 
operators to the solution of boundary value problems, see [16]. 


8.2 Directions of steepest descent 

For 

F(a) = f(a, u(a)) = \ f dx (u - u,f. 
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we have 


( 47 ) 


df . df f 

— = 0, —v = dz (u — u*)i/. 
da. du 

Keep in mind that, df /du = (u — u *) as a linear functional in the sense of (47). 

From (40), (43), and (47), the costate A £ (//g)" ^ 1S th e weak solution 
of 

L x A = -V * (oVA) - <9,, (M) = -(it - u.) in fi 

A = 0 on <9Q. (48) 


The regularity of solutions for the BVP means that we may think of A as an ele- 
ment of 77 q(Q), but its nature as a Lagrange multiplier in (Hq(Q)) is described 
via the canonical duality pairing 


< A * *W = <*- ■ *£( h oY 


that makes H isomorphic to {Hq)". Here again we encounter the issue of 
representations of linear functionals. 

From (41), 


f dh x 

F'(a)n = { A, rj 


Applying (44), we see that if we define 


g(x) = VA(x) * Vu(x) (49) 

then we arrive at the representation of F'(a) as 

F'(a)rj - f dx grj. (50) 

J n 


This integral representation achieves our first goal in the adjoint approach. This 
representation will allow us to compute the direction of steepest descent for a 
variety of norms, as we will now discuss. 

At this point the choice of domain A’ enters our deliberations. Suppose, 
as we have heretofore, that a £ A" = L°°(Q), and bi,q £ T°°. Then we are 
guaranteed in general only that u,A £ Hq , and so we can only be assured 
that the representer g defined in (49) is in L 1 . Thus — g does not immediately 
determine an I 2 direction of steepest descent because we do not know that g 
is, in fact, in L 2 . Without further hypotheses, we cannot simply take the result 
of applying the adjoint approach as a direction of steepest descent. 

However, given that g £ L 1 , we can compute the direction of steepest descent 
in the L°° norm; it is 

p{x) = -sign g(x). 

Unfortunately, this is not a particularly meaningful direction of steepest descent, 
and in the computational setting this is not particularly well-scaled. In IR n , the 
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unit ball in the norm contains points with f 2 norm y/n, so the two norms 
are quite dissimilar for large n. 

One of the problems one can encounter with the adjoint approach has emerged. 
Even if we can express the derivative in the form (36), the direction of steepest 
descent suggested by this representation may not be acceptable because of the 
regularity properties of the representer g. 

What happens if we try to improve the regularity of g by restricting attention 
to coefficients a that are smoother than just L 00 ? Well, if a £ A’ = C a {Q) and 
b is q £ , then u £ C lf0f (Q), and A £ H q, and so g £ L 2 . In this case, 

p = — gj || g \\ L2 would be the direction of steepest descent with respect to the 
L 2 norm. However, unless A £ the direction p may suffer from the flaw 

that p £ A = C^iQ). 

It can happen that A ^ C l a (fi) because the regularity of solutions of the 
adjoint problem (43) is slightly different from those of the state equation or its 
linearization, a situation not uncommon in the adjoint approach. In order to 
guarantee A £ C 1,c \ we must require not only the hypothesis a £ C Q but also 
6, £ C°\ This is because the differential operator associated with the adjoint 
contains the weak derivatives d Tt (biw), terms absent from the operator dh/du. 

Thus, in order to be assured that A £ C 1,Q (Q), we would need the additional 
regularity assumptions £ C a (fi). If these data do not satisfy these conditions, 
then the L 2 direction of steepest descent defined by (49) is not appropriate. 
Suppose it were the case that g £ L 2 but g ^ C a and we were to use p = 
— gj || g H^j, in the method of steepest descent, say. If our current iterate a c were 
in A' = C Q (n), then immediately we would produce an new iterate a+ = a c + ap 
that is not in A. In the computational setting, we could see a marked qualitative 
change appear in the step from a c to a + ; possibly “roughness’' (oscillations) or 
features of large magnitude. 

However, our difficulties go away if we compute a direction of steepest descent 
with respect to a higher-order Sobolev norm, say. the H 1 norm. We do this as 
follows. We seek a solution to the problem 

minimize (F'(a), p) = I dx gp 

Jn 

subject to | I P \\h' < 1- 

The Lagrangian for this problem is 

({p- p) = f dxgp+^f dx (Vp-Vp + p 2 ), 

Jn z Ja 

and the first-order necessary condition (which for this convex problem is suffi- 
cient) is 

— (p;p)p = dx grj + fi dx (Vp Vrj + p?/) = 0 
dp J n -hi 
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for all 77 E H l (Q ), with // > 0. But this condition is the same a s saving that p 
is the weak solution of the Neumann problem 


-V • (Vp) + p= -<7/// 



in Q 
on 


where p > 0 is chosen so that || p \\ Hl = 1. Thus, in order to compute the 
direction of steepest descent in the Z/ 1 - norm, we first need to compute g as 
in (49), and then solve this auxiliary Neumann problem. The regularity of 
solutions of elliptic problems is such that the resulting direction p is not only 
an element of H\ but also of C l,Q {Cl ). which is what we wished. 

For higher-order Sobolev norms, one would solve the weak form of an aux- 
iliary problem involving a higher-order operator. In this way one can obtain 
descent directions of ever increasing smoothness, the Sobolev norm acting as a 
preconditioner. In the computational setting, this would be done using a dis- 
crete Sobolev inner product as the weighting for the norm in the optimization 
algorithm. 


8.3 Computing the action of the Hessian 

Next we will compute the action of the Hessian and discuss its representation. 
From (12). V 2 F = W* (V 2 C) W, meaning 


V 2 F( V i^2) = V 2 C(Wri u W m ) = % 2 )). 

aa da 

We will see that to compute the action of the Hessian, we must solve two BVP. 
one of the form (42) and the other of the form (43). 

For i — 1,2, let 


We have 


V 2 /((t?i. vi). (m^2)) 


-i 


. /du 

dx viv 2 = ( — r/x, 


du \ j du x du 

da^f H \ \da da ^ 1t 




while 

F 2 / 7 ((?/i.z/i), (772^2)) = -V • - V • (772 Vz^i) 


in (Hq)', and 


(A, D 2 h[{t) 1,1/1), 

= / dx rj \ VA Vi/o -f dx 7 ) 2 VA • Vv\ 

Jn Jn 
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= <-V ■ (T|1 VA), v 2 ) H i + (VA • Vi/i , TJ 2 ) L oc 
— ^~V ■ (r)i VA), ^ +(VA - Wj. t / 2 ) L3 

(7/1 VA), 7/2 y +{VA-Vi/i, rjo) L ^. 

/ L°° 


/^ X - 
\ da 


Then 


^ 2 ^((^i > ^1)' ( 7 ?2- ^2)) = / dx 1/ 1V2+ / dx r]i VA * Vt/ 2 4 - / dr 772 V A 
Jrt J n Jfi 

or in terms of the various linear functionals, 

0 / d u x du du* — . —du \ 

V*f((77i^,),(»? 2 .^)) = ^ ^ + ^ (»?iVA) + VA-V— >?i, 7 / 2 ) 


V^i. 


If we let 


du x du dn x —du 

o=— TVi + — (»?iVA) + VA- V— rn- 
da da 


da 


da 


then we see that 0 6 L 1 and 


<V 2 F 771, 772) = I <S> 772. 


( 51 ) 


giving us an integral representation of the action of the Hessian on 771 . As in the 
case of the representation ( 50 ) of the derivative, the choice of domain A" and 
the smoothness of the other data in the problem will determine whether <f> £ L 2 
or is even more regular. 


9 Further observations on the adjoint approach 
and the representation of the derivative and 
Hessian 

A natural question to ask is when F f (a) can be represented in the form ( 36 ). 
Obviously ( 36 ) is natural for a problem posed on I 2 , such as many control 
problems, since then the Riesz Representation Theorem for Hilbert spaces tells 
us that there exists g £ L 2 for which (F'(a), p) = (g , p ) L2 • However, many 
problems, such as parameter estimation problems, are not usually posed a priori 
on a Hilbert space such as L 2 — there are typically boundedness or regularity 
constraints on the coefficients in differential operators. So, how common should 
we expect the representation ( 36 ) to be? 

The following observation might make us hopeful that the derivative gener- 
ally can be expressed in the form ( 36 ). Suppose the domain A\ whatever its 
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natural topology, is a subset of the Sobolev space H k for some k > 0, and the 
derivative F'(a) is actually a continuous linear functional in the norm of H k : 
for some C > 0, 

p)\<C\\p\\ Hk (52) 

for all p £ X. Using the Hahn-Banach Theorem we can extend F'(a) to a 
bounded linear functional on all of H k . We may identify the dual of H k with 
the negative norm Sobolev space H~ k [2]. This characterization of ( H k )' differs 
from that given by the Riesz Representation Theorem in terms of the H k inner 
product: H~ k is defined to be the completion of the space of functionals v on 
H k of the form 

(t\ p) h -„ = J vp, peH k , (53) 

for some v G L 2 . The completion is taken with respect to the norm 

IMI-k = su p I(v.p)l*I- 

\\ p\\h *< 1 

If (52) holds, then F f (a) G H~ k , and since the functionals of the form (53) 
are dense in H~ k , we might hope that we will be able to express F'(a) in the 
desired form (36), or at the very least approximate it by such simple functionals 
for which it is trivial to compute a direction of steepest descent. Moreover, 
functionals of the form (53) are also dense in the duals of other spaces of interest, 
such as C k . 

Unfortunately, the following elementary proposition points out that our hope 
for finding a representation of F'{a) of the form (36) and an associated L 2 
direction of steepest descent is circumscribed. No cheating is allowed: If one 
has a representation of F'(a) of the form (36), and this representation is well- 
behaved in the sense that the representer g(a) can be used to determine an Lr 
direction of steepest descent that behaves reasonably as a function of a. then 
morally the problem can actually be posed on L 2 to begin with. 

Proposition 9.1. Let X and H be Banach spaces such that X C H . Let 
S be a convex subset of X and denote by S the closure of S in H in the norm 
on H . 

Also suppose that F : S IR is continuously differentiable in the topology 
of X and that for all a G S and p G X . 

(F’(a), r)) x = (g(a), rj) H , 

where g(a) G H f is bounded in norm as a function of a on subsets of X bounded 
in the norm on H . Then F extends to a map F : S — > JR continuous in H . 

Proof. For 6, c G #(0, R) fl S we have 

F(b) - F(a) = ( F'(c ). b - a) x = (g(c), b - a)„ 
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for some c £ 5 on the line segment connecting a and 6, so 

I F(b) - F(a) | < || 5 (c) \\ H || b - a ||„ < h R \\ b - a \\ H 

where Kr depends only on R . This shows that F is continuous on S in the 
topology of //, so we can extend F uniquely to a map F : Sfl F(0. R) — » F? 
continuous in the norm on //. Since R > 0 was arbitrary, the proposition 
follows. □ 

Suppose that we either express F'(a) as a functional of the form (53). or 
approximate it by such a functional (as the density of such functionals in many 
dual spaces might lead us to try). Then Proposition 9.1 says that either F 
extends to L 2 , or the representer v(a) cannot even be bounded in norm on 
sets bounded in L~ norm, much less be continuous. In the latter case, when 
F does not extend to L 2 , the representer produced by the adjoint approach 
is not by itself a meaningful representation of sensitivities or a direction of 
steepest descent. In nonlinear programming terms, the descent promised by 
such a putative direction of descent is not meaningful since the function F is 
extremely nonlinear with respect to the sense of distance. In the computational 
setting, this means that the usual direction of steepest descent with respect to 
the Euclidean norm, i.e., the negative gradient of the discretized problem, may 
have less and less meaning as the discretization becomes finer. 

The conjugate gradient method applied to the BVP (35) in §7 manifests 
this pathology. The infinite-dimensional operator .4 does not extend to L 2 , so 
we should not expect a direction of descent computed with respect to the Lr 
norm to be useful. The un-preconditioned conjugate gradient algorithm uses 
approximations of exactly these bad directions of descent and generally does 
not work well. For a fine discretization, the quadratic form is too nonlinear in 
the f 2 norm for the £ 2 direction of steepest descent to be a useful predictor of 
the decrease we will see in that search direction. 


10 Conclusion 

One topic we have not discussed in this paper has been the practical details of the 
implementation of sensitivity calculations for problem governed by differential 
equations, particularly the adjoint approach. This is a large topic in its own 
right, and there is a great deal of disagreement particularly over how the adjoint 
approach should be implemented. One point of view is to derive the adjoint 
equations in the infinite-dimensional setting and then discretize them as seen 
fit. At the other end of the spectrum is the approach that works purely with 
the discretized problem, and computes the associated derivatives. Automatic 
differentiation is the extreme of this point of view; not only the discretized state 
equation but its solution scheme is differentiated. Intermediate to these points 
of view is one that works with the elements of the discretized problems in ways 
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that are analogous to how one approaches the infinite-dimensional sensitivity 
calculation. 

Our overview has emphasized the origin of sensitivity calculations in im- 
plicit differentiation, and the connection between the sensitivity formulae and 
variable reduction methods in nonlinear programming. We have stressed the 
distinction between the derivative and directions of steepest descent as the key 
to understanding the object and limitations of the adjoint approach. We hope 
this perspective on the calculation of sensitivities for problems governed by dif- 
ferential equations and other state equations will make discussion easier between 
nonlinear programmers and those interested in the application of optimization 
to their specific problems. 

The interpretation of the adjoint equations in terms of the Banach space 
adjoint we have discussed is general. The example of the adjoint approach 
given in this paper considered a problem involving weak solutions of the gov- 
erning differentia] equation, but the ideas apply in the case of classical or strong 
solutions. 

It is not always possible to express the derivative in the form (36). This 
somet imes occurs, for instance, with objectives F that involve traces of the state 
u — restrictions of u to lower-dimensional surfaces — because the trace operation 
makes df/du a distribution. This distribution shoves up on the right-hand 
side of the adjoint problem, and the solution of the adjoint problem may be a 
distribution that is not a function in the usual sense. In such cases, computing 
a direction of steepest descent with a norm other than that of L 2 . such as the 
choice of a Sobolev norm discussed in §8.2, will produce a smoother representer 
for F\ which, if sufficiently regular, may serve as a direction of steepest descent. 

Computationally, the appearance of a distribution on the right-hand side 
of the adjoint problem corresponds, say, to taking a function defined on the 
boundary of a computational grid and injecting it into the interior as a function 
that is supported only near the boundary. Computing a direction of steepest 
descent with respect- to a Sobolev norm smoothes out this data. 

Also note that applying the implicit function theorem to compute derivatives 
for problems involving traces requires that we know that solutions of the state 
equation are sufficiently smooth for the trace map to be continuous. An example 
of a problem for which such trace theorems had to be derived as part of the 
sensitivity analysis can be found in [19]. 

One could choose to view the question of norms and scaling that we have 
discussed as a bogeyman from functional analysis and infinite-dimensional op- 
timization. However, if one is attempting to use approximate a truly infinite- 
dimensional optimization problem via discretization, then the issue of scaling 
and the dependence of the direction of steepest descent on the choice of norm 
will become manifest as the level of discretization increases, as our discussion in 
connection with the conjugate gradient algorithm indicates. Even when consid- 
ering the case where the design variables a truly reside in a finite-dimensional 
domain, one needs to be aware of the issue of scaling. Moreover, when im- 
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plementing an adjoint approach in either case one will need to understand the 
nature of the intermediate quantities. 
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11 Appendix: Some results from operator the- 
ory 

Relegated to this appendix are some results on operators that are used in connec- 
tion with the reduced Hessian in Theorem 2.2. These results are identifications 
that allow us to make t he general formula for the reduced Hessian look like the 
familiar one in ZR" . 

Given Banach spaces V. Z , we will denote by B(Y , Z) the space of bounded 
bilinear maps from Y into Z. Then we have the following equivalences. 

11.1 An isomorphism of the space of bilinear maps 

There is a natural isomorphism between L{X, L[U, V’)) and B(X x U.V), the 
space of bilinear maps from X x U into l\ Given A £ L(X,L(U, V )), we may 
define a bilinear map B(x,u) = {Ax, u). Conversely, given a bilinear map 
B . A'x U -*• V, we can define A £ L( X, L(U, V)) via {Ax, u ) = B(x, u). 

11.2 Second derivatives as bilinear maps 

The derivative of a map $ : Y Z is a map D<b : y D${ y) £ L(Y, Z), so 
its derivative. D 7 $. is a map D 2 $ : y | — >- D 2 ${y) £ L(Y, L(Y, Z)). Lsing the 
identification in §11.1, we may then canonically view D 2 4> as a bilinear map in 
B(Y x Y.Z ). 

11.3 The adjoint of a bilinear form 

A bilinear form B on IR n x IR m has the form B(x,u) = x T Bu = u T B T x for 
some n x m matrix B. We may view B as mapping ZR" to linear functionals 
(row vectors) in via B : x t-y x T B, and B T as mapping ZR' 1 to linear 

functionals in (ZR' 1 )' via B T : v it T B T . 

The general analog is the following. Suppose that B\ : X x U IR and 
Bn : V x X -4 ZR are bounded bilinear forms and that B\(x,u) = B^iu.x) 
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for all x.u. Using the identification of §11.1, we have B\ E B{ X x U, IR) % 
L(X.L{U, IR)) = L(X. J7'). Likewise, we have Bo E L(U,X '), and 

<Rix, r<) — {R 2 u, x) ■ 

Then : U" -> X' and R 2 X : X" -4 U'. Since there is a natural embedding 
U C t r ". we may view' £ x as a map B* : U -> A''. Likewise, we may view' R 2 
as a map B% : A* — > £7', as desired. 

11.4 Composition of linear maps and bilinear forms 

Given a bilinear form B(x, u) = x T Bu — u T B T x on lR n x ZR m , then 

B(A\Xi, A 2 X 0 ) — x\ A 2 BA\X\ — BA\{x\, X 2 ) 

= xJaJB t A 2 x 2 = AJB t A 2 (x 2 . Xi ) 

where we are defining the bilinear forms A? BA\(x\, x 2 ) and A[ B T A 2 {x 2 ,x\) 
in the obvious w^ay. 

The general analog is derived similarly. Suppose that B : X x U — > IR is a 
bilinear form, A\ : Aj —¥ A , and Ao : A 2 A. Then using the interpretation 
in §11.3 of B x : U -4 X* we have 

B{A\X\. .425*2) = (A* B* Ao) {x 2 )(x\) = (A? BA\) (xi)(x 2 ) (54) 
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